Introduction to tidymodels

R. Dimas Bagas Herlambang

August 16, 2019

Disclaimer


The following presentation is produced by the team at Algoritma for its internal training This presentation is intended for a restricted audience only. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Outline

Why tidymodels Matters?

  • Things we think we’re doing it right
  • Things we never think we could do it better

Setting the Cross-Validation Scheme using rsample

  • Rethinking: Why we need validation?
  • Tidy way to rsample-ing your dataset

Data Preprocess using recipes

  • Rethinking: How we should treat train and test?
  • Reproducible preprocess recipes

Model Fitting using parsnip

  • Rethinking: How many machine learning packages you used?
  • One vegeta.. I mean package to rule them all: parsnip

Model Evaluation using yardstick

  • Rethinking: How we measure the goodness of our model?
  • It’s always good to bring your own yardstick

Dive Deeper: Summing-up the Baseline Workflow

Intermediate Workflow: Hyperparameter Tuning

Why tidymodels Matters?

Things we think we’re doing it right

Sample splitting

Numeric scaling

Things we never think we could do it better

How we see model performance

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  1036  197
#>        yes  197   40
#>                                           
#>                Accuracy : 0.732           
#>                  95% CI : (0.7085, 0.7545)
#>     No Information Rate : 0.8388          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.009           
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.16878         
#>             Specificity : 0.84023         
#>          Pos Pred Value : 0.16878         
#>          Neg Pred Value : 0.84023         
#>              Prevalence : 0.16122         
#>          Detection Rate : 0.02721         
#>    Detection Prevalence : 0.16122         
#>       Balanced Accuracy : 0.50450         
#>                                           
#>        'Positive' Class : yes             
#> 

How we use Receiver Operating Curve

Setting the Cross-Validation Scheme using rsample

Rethinking: Why we need validation?

Tidy way to rsample-ing your dataset

rsample is part of tidymodels that could help us in splitting or resampling or machine learning dataset.

There are so many splitting and resampling approach provided by rsample–as you could see in its full function references page. In this introduction, we will use two most general function:

Initial splitting

Initial random sampling for splitting train and test could be done using initial_split():

# set seed
set.seed(100)

# create initial split
splitted <- initial_split(attrition, prop = 0.8)

# check train dataset
training(splitted)
#> # A tibble: 1,177 x 35
#>    attrition   age business_travel daily_rate department distance_from_h…
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 yes          41 travel_rarely         1102 sales                     1
#>  2 no           49 travel_frequen…        279 research_…                8
#>  3 yes          37 travel_rarely         1373 research_…                2
#>  4 no           33 travel_frequen…       1392 research_…                3
#>  5 no           27 travel_rarely          591 research_…                2
#>  6 no           32 travel_frequen…       1005 research_…                2
#>  7 no           59 travel_rarely         1324 research_…                3
#>  8 no           30 travel_rarely         1358 research_…               24
#>  9 no           38 travel_frequen…        216 research_…               23
#> 10 no           36 travel_rarely         1299 research_…               27
#> # … with 1,167 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

# check test dataset
testing(splitted)
#> # A tibble: 293 x 35
#>    attrition   age business_travel daily_rate department distance_from_h…
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 no           29 travel_rarely         1389 research_…               21
#>  2 no           38 travel_rarely          371 research_…                2
#>  3 yes          36 travel_rarely         1218 sales                     9
#>  4 yes          34 travel_rarely          699 research_…                6
#>  5 no           53 travel_rarely         1282 research_…                5
#>  6 yes          24 travel_rarely          813 research_…                1
#>  7 no           36 travel_rarely          852 research_…                5
#>  8 no           27 travel_rarely         1240 research_…                2
#>  9 yes          41 travel_rarely         1360 research_…               12
#> 10 yes          48 travel_rarely          626 research_…                1
#> # … with 283 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

But sometimes, simple random sampling is not enough:

This is where we need strata argument to use stratified random sampling:

Cross-sectional resampling

To use k-fold validation splits–and optionally, with repetition–we could use vfold_cv():

Each train and test dataset are stored in splits column. We could use analysis() and assessment() to get the train and test dataset, consecutively:

# check train dataset on an example split
analysis(resampled$splits[[1]])
#> # A tibble: 980 x 35
#>    attrition   age business_travel daily_rate department distance_from_h…
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 no           49 travel_frequen…        279 research_…                8
#>  2 yes          37 travel_rarely         1373 research_…                2
#>  3 no           33 travel_frequen…       1392 research_…                3
#>  4 no           27 travel_rarely          591 research_…                2
#>  5 no           59 travel_rarely         1324 research_…                3
#>  6 no           36 travel_rarely         1299 research_…               27
#>  7 no           29 travel_rarely          153 research_…               15
#>  8 no           31 travel_rarely          670 research_…               26
#>  9 no           34 travel_rarely         1346 research_…               19
#> 10 no           22 non_travel            1123 research_…               16
#> # … with 970 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

# check test dataset on an example split
assessment(resampled$splits[[1]])
#> # A tibble: 490 x 35
#>    attrition   age business_travel daily_rate department distance_from_h…
#>    <chr>     <dbl> <chr>                <dbl> <chr>                 <dbl>
#>  1 yes          41 travel_rarely         1102 sales                     1
#>  2 no           32 travel_frequen…       1005 research_…                2
#>  3 no           30 travel_rarely         1358 research_…               24
#>  4 no           38 travel_frequen…        216 research_…               23
#>  5 no           35 travel_rarely          809 research_…               16
#>  6 yes          28 travel_rarely          103 research_…               24
#>  7 no           29 travel_rarely         1389 research_…               21
#>  8 no           32 travel_rarely          334 research_…                5
#>  9 no           38 travel_rarely          371 research_…                2
#> 10 yes          36 travel_rarely         1218 sales                     9
#> # … with 480 more rows, and 29 more variables: education <dbl>,
#> #   education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> #   over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> #   relationship_satisfaction <dbl>, standard_hours <dbl>,
#> #   stock_option_level <dbl>, total_working_years <dbl>,
#> #   training_times_last_year <dbl>, work_life_balance <dbl>,
#> #   years_at_company <dbl>, years_in_current_role <dbl>,
#> #   years_since_last_promotion <dbl>, years_with_curr_manager <dbl>

Data Preprocess using recipes

Rethinking: How do we should treat train and test?

Reproducible preprocess recipes

recipes is part of tidymodels that could help us in making a reproducible data preprocess.

There are so many data preprocess approach provided by recipes–as you could see in its full function references page. In this introduction, we will use several preprocess steps related to our example dataset.

There are several steps that we could apply to our dataset–some are very fundamental and sometimes mandatory, but some are also tuneable:

  • step_rm(): Manually remove unused columns.
  • step_nzv(): Automatically filter near-zero varianced columns.
  • step_string2factor(): Manually convert to factor columns.
  • step_downsample(): Downsampling step to balancing target’s class distribution (tuneable).
  • step_center(): Normalize the mean of numeric column(s) to zero (tuneable).
  • step_scale(): Normalize the standard deviation of numeric column(s) to one (tuneable).
  • step_pca(): Shrink numeric column(s) to several PCA components (tuneable).

Designing your first preprocess recipes

  1. Initiate a recipe using recipe()
    • Define your formula in the first argument.
    • Supply a template dataset in data argument.
  2. Pipe to every needed step_*()–always remember to put every step in proper consecutive manner.
  3. After finished with every needed step_*(), pipe to prep() function to train your recipe
    • It will automatically convert all character columns to factor;
    • If you used step_string2factor(), don’t forget to specify strings_as_factors = FALSE
    • It will train the recipe to the specified dataset in the recipe()’s data argument;
    • If you want to train to other dataset, you can supply the new dataset to training argument, and set the fresh argument to TRUE

Let’s see an example of defining a recipe:

There are two ways of obtaining the result from our recipe:

  • juice(): Extract preprocessed dataset from prep()-ed recipe. Normally, we will use this function to get preprocessed train dataset.

  • bake(): Apply a recipe to new dataset. Normally, we use we will use this function to preprocess new dataset, such as test dataset, or prediction dataset.

These are some example on how to get preprocessed train and test dataset:

# get preprocessed train dataset
data_train <- juice(rec)

# quick check
data_train
#> # A tibble: 380 x 22
#>    business_travel department education_field gender job_role
#>    <fct>           <fct>      <fct>           <fct>  <fct>   
#>  1 travel_rarely   sales      life_sciences   female sales_e…
#>  2 travel_rarely   research_… other           male   laborat…
#>  3 travel_rarely   research_… life_sciences   male   laborat…
#>  4 travel_rarely   research_… medical         male   researc…
#>  5 travel_frequen… research_… life_sciences   female researc…
#>  6 travel_rarely   sales      technical_degr… male   sales_r…
#>  7 travel_rarely   research_… medical         male   researc…
#>  8 travel_rarely   sales      marketing       male   sales_r…
#>  9 travel_rarely   research_… technical_degr… female researc…
#> 10 travel_rarely   research_… life_sciences   male   laborat…
#> # … with 370 more rows, and 17 more variables: marital_status <fct>,
#> #   over_time <fct>, attrition <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>,
#> #   PC04 <dbl>, PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>,
#> #   PC09 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>

# get preprocessed test dataset
data_test <- bake(rec, testing(splitted))

# quick check
data_test
#> # A tibble: 293 x 22
#>    attrition business_travel department education_field gender job_role
#>    <fct>     <fct>           <fct>      <fct>           <fct>  <fct>   
#>  1 no        travel_rarely   research_… life_sciences   female manufac…
#>  2 no        non_travel      research_… other           female manufac…
#>  3 yes       travel_rarely   sales      life_sciences   male   sales_r…
#>  4 no        travel_rarely   research_… other           female manager 
#>  5 no        travel_rarely   sales      marketing       male   sales_e…
#>  6 no        travel_rarely   sales      marketing       female sales_r…
#>  7 no        travel_rarely   research_… other           male   laborat…
#>  8 yes       travel_rarely   research_… life_sciences   male   laborat…
#>  9 no        travel_frequen… research_… medical         female laborat…
#> 10 no        travel_rarely   research_… life_sciences   male   researc…
#> # … with 283 more rows, and 16 more variables: marital_status <fct>,
#> #   over_time <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>, PC04 <dbl>,
#> #   PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>, PC09 <dbl>,
#> #   PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>

Model Fitting using parsnip

Rethinking: How many machine learning packages you used?

One vegeta.. I mean package to rule them all: parsnip

parsnip is part of tidymodels that could help us in model fitting and prediction flows.

There are so many models supported by parsnip–as you could see in its full model list. In this introduction, we will use random forest as an example model.

There are two part of defining a model that should be noted:

  • Defining model’s specification: In this part, we need to define the model’s specification, such as mtry and trees for random forest, through model specific functions. For example, you can use rand_forest() to define a random forest specification. Make sure to check full model list to see every model and its available arguments.

  • Defining model’s engine: In this part, we need to define the model’s engine, which determines the package we will use to fit our model. This part could be done using set_engine() function. Note that in addition to defining which package we want to use as our engine, we could also passing package specific arguments to this function.

This is an example of defining a random forest model using randomForest::randomForest() as our engine:

To fit our model, we have two options:

  • Formula interface
  • X-Y interface

Note that some packages are behaving differently inside formula and x-y interface. For example, randomForest::randomForest() would convert all of our categorical variables into dummy variables in formula interface, but not in x-y interface.

Fit using formula interface using fit() function:

Or through x-y interface using fit_xy() function:

In this workflow, it should be relatively easy to change the model engine.

Let’s try to fit a same model specification, but now using ranger::ranger():

Now let’s try to fit the model, and see the result:

Notice that ranger::ranger() doesn’t behave differently between fit() and fit_xy():

To get the prediction, we could use predict() as usual–but note that it would return a tidied tibble instead of a vector, as in type = "class" cases, or a raw data.frame, as in type = "prob" cases.

In this way, the prediction results would be very convenient for further usage, such as simple recombining with the original dataset:

Model Evaluation using yardstick

Rethinking: How do we measure the goodness of our model?

It’s always good to bring your own yardstick

yardstick is part of tidymodels that could help us in calculating model performance metrics.

There are so many metrics available by yardstick–as you could see in its full function references page. In this introduction, we will calculate some model performance metrics for classification task as an example.

There are two ways of calculating model performance metrics, which differ in its input and output:

  • tibble approach: We pass a dataset containing the truth and estimate to the function, and it will return a tibble containing the results, e.g., precision() function.
  • vector approach: We pass a vector as the truth and a vector as the estimate to the function, and it will return a vector which show the results, e.g., precision_vec() function.

Note that some function, like conf_mat(), only accept tibble approach, since it is not returned a vector of length one.

Let’s start by reporting the confusion matrix:

Now, to calculate the performance metrics, let’s try to use the tibble approach–which also support group_by:

Or using vector approach, which is more flexible in general:

Sometimes the model performance metrics could also improving the models final results. For example, through Receiver Operating Curve, we could assess the probability threshold effect to sensitivity and specificity:

And, since it’s returning a tibble, we could do further data wrangling to help us see it more clearly:

With some ggplot2:

and plotly magic, it would be perfect:

Dive Deeper: Summing-up the Baseline Workflow

Cross-validation scheme

Data preprocess

Model fitting

Model evaluation

Intermediate Workflow: Hyperparameter Tuning

Please checkout: https://github.com/bagasbgy/tidymodels-examples at classification-rand_forest branch.

Learn-by-building: Hyperparameter Tuning an xgboost Model

With:

  • loss_reduction held constant at 0.01
  • learn_rate held constant at 0.01
  • sample_size held constant at 0.8

Choose best model which giving accuracy, sensitivity, and specificity above 70%!

Happy Learning and Coding! 😄